Final Project
Final Project
Author

Niharika Pola

Published

December 13, 2022

Code
library(tidyverse)
library(ggplot2)
library(readr)
library(corrplot)
library(reshape2)
library(corrgram)
library(ggpubr)
library(caTools)
library(GGally)
library(stats)
library(ggsci)

Happiness Explained: Through Multiple Linear Regression.

Background/Motivation

What is takes for a country or a continent to be happy or sad? Is it the higher GDP per capita, lower Population growth, lower Suicide rate or Urbanization? What are the factors that affect a country’s or continents overall happiness? Can we predict the happiness score? The curiosity to find answers to these questions made me explore the world happiness scores and world data by country.

The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

The data sets that are used in this analysis are 1) world happiness report 2021 and 2) kaggle’s world data by country 2020. Happiness scores from first data set and few variables from the second data set are being used to analyse the following research question.

Research Question

  1. What are the variables or factors that are affecting a country’s happiness?
  2. To find out which model fits/accurately predicts the happiness score.

Hypothesis

I wish to test the following hypothesis,

  1. Higher/Better GDP per capital leads to a country’s happiness.
  2. Urbanization leads to a country’s happiness.
  3. Higher Life Expectancy leads to country’s happiness.

This study aims to test the above hypothesis while comparing the GDP per capital and Urbanization with other variables including: Population growth, Life Expectancy, Sex-Ratio and Suicide Rate.

Reading & Preparing the data sets

Code
#dataset-1
happiness <- read_csv("project datasets/world-happiness-report-2021.csv")
Rows: 149 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country name, Regional indicator
dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
hap_sc <- select(happiness, "Country name", "Ladder score")
hap_sc <- rename(hap_sc, "Country"="Country name", "Happiness"="Ladder score")
colnames(hap_sc)
[1] "Country"   "Happiness"
Code
#datasets from world data by country
GDP_per_capita <- read_csv("project datasets/GDP per capita.csv")
Rows: 191 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): GDP per capita

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
Urbanization_rate <- read_csv("project datasets/Urbanization rate.csv")
Rows: 218 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Urbanization rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
Population_growth <- read_csv("project datasets/Population growth.csv")
Rows: 207 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Population growth

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
Life_expectancy <- read_csv("project datasets/Life expectancy.csv")
Rows: 185 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Life expectancy

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
Suicide_rate <- read_csv("project datasets/Suicide rate.csv")
Rows: 182 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Suicide rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
Sex_ratio <- read_csv("project datasets/Sex-ratio.csv")
Rows: 226 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Sex-ratio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
#merging the above datasets by country and iso code

world_data <- merge(GDP_per_capita, Urbanization_rate)
world_data <- merge(world_data, Population_growth)
world_data <- merge(world_data, Life_expectancy)
world_data <- merge(world_data, Suicide_rate)
world_data <- merge(world_data, Sex_ratio)



#merging happiness scores and world data
happy <- merge(world_data, hap_sc)
happy <- rename(happy, "GDP.per.capita"="GDP per capita", "urbanization.rate"="Urbanization rate", "Population.growth"="Population growth", "Life.expectancy"="Life expectancy", "Suicide.rate"="Suicide rate", "Sex.ratio"="Sex-ratio")
str(happy)
'data.frame':   199 obs. of  9 variables:
 $ Country          : chr  "Afghanistan" "Algeria" "Argentina" "Armenia" ...
 $ ISO-code         : chr  "AFG" "DZA" "ARG" "ARM" ...
 $ GDP.per.capita   : num  2182 16091 19971 11845 54799 ...
 $ urbanization.rate: num  26 73.7 92.1 63.3 86.2 58.7 56.4 89.5 38.2 79.5 ...
 $ Population.growth: num  2.41 1.89 0.88 0.17 1.6 0.46 1.35 1.92 1.19 -0.1 ...
 $ Life.expectancy  : num  64.5 76.7 76.5 74.9 83.3 81.4 72.9 77.2 72.3 74.6 ...
 $ Suicide.rate     : num  6.4 3.3 9.1 5.7 11.7 11.4 2.6 5.7 6.1 21.4 ...
 $ Sex.ratio        : num  1.03 1.03 0.98 0.95 0.99 0.96 0.98 1.53 0.97 0.87 ...
 $ Happiness        : num  2.52 4.89 5.93 5.28 7.18 ...
Code
#adding continents data to final dataset
happy$Continent <- NA

happy$Continent[which(happy$Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
                                   "Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
                                   "South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
                                   "Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
                                   "Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
                                   "Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
                                   "Cambodia", "Afghanistan", "Yemen", "Syria"))] <- "Asia"
happy$Continent[which(happy$Country %in% c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
                                   "Netherlands", "Sweden", "Austria", "Ireland", "Germany",
                                   "Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
                                   "Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
                                   "Russia", "Lithuania", "Latvia", "Moldova", "Romania",
                                   "Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
                                   "Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
                                   "Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
                                   "Bulgaria", "Albania", "Ukraine"))] <- "Europe"
happy$Continent[which(happy$Country %in% c("Canada", "Costa Rica", "United States", "Mexico",  
                                   "Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala",
                                   "Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
                                   "Haiti"))] <- "North America"
happy$Continent[which(happy$Country %in% c("Chile", "Brazil", "Argentina", "Uruguay",
                                   "Colombia", "Ecuador", "Bolivia", "Peru",
                                   "Paraguay", "Venezuela"))] <- "South America"
happy$Continent[which(happy$Country %in% c("New Zealand", "Australia"))] <- "Australia"
happy$Continent[which(is.na(happy$Continent))] <- "Africa"

# Moving the continent column's position in the dataset to the second column

happy <- happy %>% select(Country,Continent, everything())
happy <- happy[!duplicated(happy), ]

#Renaming the final dataframe to happy
view(happy)
happy <- happy[-(48:110), ]
str(happy)
'data.frame':   136 obs. of  10 variables:
 $ Country          : chr  "Afghanistan" "Algeria" "Argentina" "Armenia" ...
 $ Continent        : chr  "Asia" "Africa" "South America" "Asia" ...
 $ ISO-code         : chr  "AFG" "DZA" "ARG" "ARM" ...
 $ GDP.per.capita   : num  2182 16091 19971 11845 54799 ...
 $ urbanization.rate: num  26 73.7 92.1 63.3 86.2 58.7 56.4 89.5 38.2 79.5 ...
 $ Population.growth: num  2.41 1.89 0.88 0.17 1.6 0.46 1.35 1.92 1.19 -0.1 ...
 $ Life.expectancy  : num  64.5 76.7 76.5 74.9 83.3 81.4 72.9 77.2 72.3 74.6 ...
 $ Suicide.rate     : num  6.4 3.3 9.1 5.7 11.7 11.4 2.6 5.7 6.1 21.4 ...
 $ Sex.ratio        : num  1.03 1.03 0.98 0.95 0.99 0.96 0.98 1.53 0.97 0.87 ...
 $ Happiness        : num  2.52 4.89 5.93 5.28 7.18 ...

The final data set “happy” consists of 12 variables and 136 observations. We can start the analysis now.

Exploratory Data Analysis

Understanding the correlation between happiness scores and other variables.

Code
# Create a correlation plot
ggcorr(happy, label = TRUE, label_round = 2, label_size = 3.5, size = 4, hjust = .85) +
  ggtitle("Plot- 1: Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.5))
Warning in ggcorr(happy, label = TRUE, label_round = 2, label_size = 3.5, : data
in column(s) 'Country', 'Continent', 'ISO-code' are not numeric and were ignored

Observations from the correlation plot:

The happiness score is,

  • Strongly correlating with GDP per capita, Urbanization rate and Life expectancy.

  • Inversely correlated with suicidal rate and population growth.

  • No correlation with sex ratio.

Code
#plotting the happiness data on world map

plot_2 <- select(happy, "Continent", "Country", "Happiness")

library(maps)

Attaching package: 'maps'
The following object is masked from 'package:purrr':

    map
Code
world.data <- map_data("world")
map.world_joined <- left_join(world.data, plot_2, by = c('region'='Country'))

map1 <- ggplot(map.world_joined, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Happiness), color="black")

map2 <- map1 + scale_fill_gradient(name = "Happiness Score", low = "yellow", high = "red", na.value = "grey50") +
  theme(axis.text.x = element_blank(), 
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        rect = element_blank())
map2

North America, Europe and Australia appear to be the happiest.

South America, North Asia, Western Africa and countries/islands in the indian ocean appear to be happy on average.

South Africa, India are unhappy and of course Afghanistan is the unhappiest country in the world.

Happiness score comparison on different continents

Code
####### Happiness score for each continent

gg1 <- ggplot(happy,
              aes(x=Continent,
                  y=Happiness,
                  color=Continent))+
  geom_point() + theme_bw() +
  theme(axis.title = element_text(family = "Helvetica", size = (8)))

gg2 <- ggplot(happy , aes(x = Continent, y = Happiness)) +
  geom_boxplot(aes(fill=Continent)) + theme_bw() +
  theme(axis.title = element_text(family = "Helvetica", size = (8)))

gg3 <- ggplot(happy,aes(x=Continent,y=Happiness))+
  geom_violin(aes(fill=Continent),alpha=0.7)+ theme_bw() +
  theme(axis.title = element_text(family = "Helvetica", size = (8)))

# Compute descriptive statistics by groups
stable <- desc_statby(happy, measure.var = "Happiness",
                      grps = "Continent")
stable <- stable[, c("Continent","mean","median")]
names(stable) <- c("Continent", "Mean of happiness score","Median of happiness score")
# Summary table plot
stable.p <- ggtexttable(stable,rows = NULL, 
                         theme = ttheme("classic"))


ggarrange(gg1, gg2, ncol = 1, nrow = 2)

Code
ggarrange(gg3, stable.p, ncol = 1, nrow = 2)

As we have seen before, Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.

Let’s see the correlation between happiness score and 3 major factors in the happiness data set that are GDP per capita, Life Expectancy, Urbanization rate to understand how they are varying.

Code
ggplot(subset(happy, happy$Continent != "Australia"), aes(x =  GDP.per.capita, y = Happiness)) + 
  geom_point(aes(color=Continent), size = 3, alpha = 0.8) +  
  geom_smooth(aes(color = Continent, fill = Continent), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Continent) +
  theme_bw() + labs(title = "Scatter plot with regression line Happiness vs GDP per capita")
`geom_smooth()` using formula = 'y ~ x'

The correlation between GDP per capita and happiness score in Europe, North America, and Asia is more significant than the other continents. Worth mentioning that we will not take Australia into account because there are just two countries in Australia and creating scatter plot with the regression line for this continent will not give us any insight.

Code
ggplot(subset(happy, happy$Continent != "Australia"), aes(x =  urbanization.rate, y = Happiness)) + 
  geom_point(aes(color=Continent), size = 3, alpha = 0.8) +  
  geom_smooth(aes(color = Continent, fill = Continent), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Continent) +
  theme_bw() + labs(title = "Scatter plot with regression line Happiness vs Urbanization rate")
`geom_smooth()` using formula = 'y ~ x'

The relationship between happiness score and urbanization rate is very significant in Asia, Africa and Europe followed by North america and south america.

Code
ggplot(subset(happy, happy$Continent != "Australia"), aes(x =  Life.expectancy, y = Happiness)) + 
  geom_point(aes(color=Continent), size = 3, alpha = 0.8) +  
  geom_smooth(aes(color = Continent, fill = Continent), 
              method = "lm", fullrange = TRUE) +
  facet_wrap(~Continent) +
  theme_bw() + labs(title = "Scatter plot with regression line Happiness vs Life Expectancy")
`geom_smooth()` using formula = 'y ~ x'

The relationship between happiness score and life expectancy is very significant in Asia, Europe and Africa followed by North america and south america.

Modeling & Hypothesis testing

From the above visualizations and analysis, we have found that GDP per capita, Urbanization and Life Expectancy play major role in the happiness score. Whereas suicide rate and population growth has an inverse relationship. Let’s build a model for estimating happiness scores based on these control variables and see which of these has effect on the overall happiness scores.

Model-1

Code
#Model-1
model1 <- lm(Happiness ~ GDP.per.capita + urbanization.rate + Life.expectancy, data = happy)
summary(model1)

Call:
lm(formula = Happiness ~ GDP.per.capita + urbanization.rate + 
    Life.expectancy, data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89896 -0.37351  0.07498  0.48488  1.15628 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       4.510e-01  7.280e-01   0.620   0.5366    
GDP.per.capita    1.790e-05  3.867e-06   4.628 8.72e-06 ***
urbanization.rate 6.992e-03  3.646e-03   1.918   0.0573 .  
Life.expectancy   5.814e-02  1.156e-02   5.030 1.57e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6355 on 132 degrees of freedom
Multiple R-squared:  0.6693,    Adjusted R-squared:  0.6618 
F-statistic: 89.07 on 3 and 132 DF,  p-value: < 2.2e-16

It can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor/control variables ( GDP per capita, Urbanization rare and Life Expectancy) is significantly related to the outcome variable (Happiness score).

It can be seen that, changing in GDP per capita and Life expectancy are significantly associated to changes in happiness scores while changes in urbanization rate is not significantly associated with sales.

Code
summary(model1)$coefficient
                      Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)       4.510470e-01 7.280370e-01 0.6195386 5.366295e-01
GDP.per.capita    1.789683e-05 3.867139e-06 4.6279250 8.717287e-06
urbanization.rate 6.992347e-03 3.645755e-03 1.9179423 5.727811e-02
Life.expectancy   5.814042e-02 1.155817e-02 5.0302459 1.567648e-06
Code
confint(model1)
                          2.5 %       97.5 %
(Intercept)       -9.890821e-01 1.891176e+00
GDP.per.capita     1.024724e-05 2.554641e-05
urbanization.rate -2.193158e-04 1.420401e-02
Life.expectancy    3.527723e-02 8.100362e-02

Lets see how this model behaves if we apply log on the variables

Code
#Model-1 with log
model1_log <- lm(Happiness ~ log(GDP.per.capita + urbanization.rate + Life.expectancy), data = happy)
summary(model1_log)

Call:
lm(formula = Happiness ~ log(GDP.per.capita + urbanization.rate + 
    Life.expectancy), data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32889 -0.48506  0.05291  0.51078  1.36460 

Coefficients:
                                                          Estimate Std. Error
(Intercept)                                               -1.66107    0.49628
log(GDP.per.capita + urbanization.rate + Life.expectancy)  0.75478    0.05177
                                                          t value Pr(>|t|)    
(Intercept)                                                -3.347  0.00106 ** 
log(GDP.per.capita + urbanization.rate + Life.expectancy)  14.581  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.682 on 134 degrees of freedom
Multiple R-squared:  0.6134,    Adjusted R-squared:  0.6105 
F-statistic: 212.6 on 1 and 134 DF,  p-value: < 2.2e-16

The R-Squared value drops from 0.6693 to 0.6134 by applying log for model-1.

Model-2

Code
#Model-2
model2  <- lm(Happiness ~ GDP.per.capita + Life.expectancy, data = happy)
summary(model2)

Call:
lm(formula = Happiness ~ GDP.per.capita + Life.expectancy, data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.01976 -0.39026  0.09436  0.49252  1.18774 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.683e-01  7.201e-01   0.234    0.816    
GDP.per.capita  2.028e-05  3.699e-06   5.481 2.05e-07 ***
Life.expectancy 6.713e-02  1.067e-02   6.292 4.20e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6418 on 133 degrees of freedom
Multiple R-squared:  0.6601,    Adjusted R-squared:  0.655 
F-statistic: 129.2 on 2 and 133 DF,  p-value: < 2.2e-16
Code
summary(model2)$coefficient
                    Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)     1.683136e-01 7.200997e-01 0.2337366 8.155489e-01
GDP.per.capita  2.027701e-05 3.699299e-06 5.4813121 2.046209e-07
Life.expectancy 6.713497e-02 1.066984e-02 6.2920307 4.201007e-09

It turns out Life Expectancy is strongly associated with happiness scores and GDP per capita follows life expectancy.

Code
confint(model2)
                        2.5 %       97.5 %
(Intercept)     -1.256016e+00 1.592643e+00
GDP.per.capita   1.295994e-05 2.759408e-05
Life.expectancy  4.603044e-02 8.823951e-02

However the R-squared values of Model-1 and Model-2 are approximately around 65-66%. Hence we cannot say which model is best fit to predict the happiness scores.

Lets consider all the 5 control variables and see the results.

Lets see how model-2 behaves if we apply log on the variables

Code
#Model-2 with log
model2_log  <- lm(Happiness ~ log(GDP.per.capita + Life.expectancy), data = happy)
summary(model2_log)

Call:
lm(formula = Happiness ~ log(GDP.per.capita + Life.expectancy), 
    data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32936 -0.47756  0.04899  0.51212  1.36194 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           -1.62336    0.49427  -3.284   0.0013 ** 
log(GDP.per.capita + Life.expectancy)  0.75129    0.05158  14.565   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6824 on 134 degrees of freedom
Multiple R-squared:  0.6129,    Adjusted R-squared:   0.61 
F-statistic: 212.1 on 1 and 134 DF,  p-value: < 2.2e-16

The R-Squared value drops from 0.66 to 0.61 by applying log for model-2.

Model-3

Code
#Model-3
model3 <- lm(Happiness ~ GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth , data = happy)
summary(model3)

Call:
lm(formula = Happiness ~ GDP.per.capita + urbanization.rate + 
    Life.expectancy + Suicide.rate + Population.growth, data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.87156 -0.39252  0.05248  0.45831  1.25675 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -2.712e-02  1.268e+00  -0.021 0.982977    
GDP.per.capita     1.646e-05  4.281e-06   3.845 0.000188 ***
urbanization.rate  6.826e-03  3.647e-03   1.872 0.063456 .  
Life.expectancy    6.341e-02  1.667e-02   3.804 0.000218 ***
Suicide.rate       1.586e-02  1.317e-02   1.204 0.230615    
Population.growth -1.622e-02  7.576e-02  -0.214 0.830844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6348 on 130 degrees of freedom
Multiple R-squared:  0.6751,    Adjusted R-squared:  0.6626 
F-statistic: 54.01 on 5 and 130 DF,  p-value: < 2.2e-16

The R-squared remains same after adding additional two control variables. So far, the second model fits the happiness scores.

Lets see how model-3 behaves if we apply log on the control variables

Code
#Model-3 with log
model3_log <- lm(Happiness ~ log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) , data = happy)
summary(model3_log)

Call:
lm(formula = Happiness ~ log(GDP.per.capita + urbanization.rate + 
    Life.expectancy + Suicide.rate + Population.growth), data = happy)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32855 -0.48525  0.05338  0.51003  1.36226 

Coefficients:
                                                                                             Estimate
(Intercept)                                                                                  -1.67866
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth)  0.75648
                                                                                             Std. Error
(Intercept)                                                                                     0.49727
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth)    0.05186
                                                                                             t value
(Intercept)                                                                                   -3.376
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth)  14.587
                                                                                             Pr(>|t|)
(Intercept)                                                                                  0.000963
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth)  < 2e-16
                                                                                                
(Intercept)                                                                                  ***
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6818 on 134 degrees of freedom
Multiple R-squared:  0.6136,    Adjusted R-squared:  0.6107 
F-statistic: 212.8 on 1 and 134 DF,  p-value: < 2.2e-16

The R-squared values drops from 0.66 to 0.61 by application of log.

Model comprision

Models Multiple R-squared Adjusted R-squared Significant Variables
Model-1 (GDP per capita+Urbanization rate+Life expectancy) 0.6693 0.6618 GDP per capita & Life Expectancy
Model-1 (Log(GDP per capita+Urbanization rate+Life expectancy)) 0.6134 0.6105 -
Model-2 (GDP per capita+Life expectancy) 0.6601 0.655

GDP per capita & Life Expectancy.

Life expectancy > GDP per capita

Model-2 (Log(GDP per capita+Life expectancy)) 0.6129 0.61 -
Model-3 (GDP per capita+Urbanization rate +Life expectancy + Suicide rate + population growth) 0.6601 0.655 GDP per capita & Life Expectancy
Model-3 (Log(GDP per capita+Urbanization rate +Life expectancy + Suicide rate + population growth)) 0.6136 0.6107

GDP per capita & Life Expectancy.

Life expectancy > GDP per capita

  • It can be derived from the above table that the adjusted R-Squared for Model-2 is better than other two models, hence it can be said that Model-2 is a good fit.

  • Application of Logarithm on the control variables reduced the R-squared values in all the 3 models.

  • Hypothesis testing conclusion:

    Turns out we have evidence for all the 3 hypothesis but,

    • We have a strong evidence for hypothesis-3 i.e. Higher Life Expectancy leads to country’s happiness.

------

As we have concluded that Model-2 is a goof fit, Lets divide the happy dataset into training and test and plot the predicted happiness scores for Model-2.

Code
# Splitting the dataset into the Training set and Test set
set.seed(123)
dataset <- happy[4:10]
split = sample.split(dataset$Happiness, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Code
# Fitting Multiple Linear Regression to the Training set on Model-2
regressor_lm = lm(formula = Happiness ~ GDP.per.capita + Life.expectancy,
               data = training_set)

summary(regressor_lm)

Call:
lm(formula = Happiness ~ GDP.per.capita + Life.expectancy, data = training_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9617 -0.4055  0.0248  0.4922  1.1547 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -6.041e-02  7.946e-01  -0.076     0.94    
GDP.per.capita   2.117e-05  4.330e-06   4.889 3.65e-06 ***
Life.expectancy  6.975e-02  1.184e-02   5.892 4.65e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6397 on 105 degrees of freedom
Multiple R-squared:  0.6834,    Adjusted R-squared:  0.6774 
F-statistic: 113.3 on 2 and 105 DF,  p-value: < 2.2e-16
Code
y_pred_lm = predict(regressor_lm, newdata = test_set)

Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = test_set$Happiness))

gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
  geom_point() + theme_bw() + geom_abline() +
  labs(title = "Multiple Linear Regression", x = "Actual happiness score",
       y = "Predicted happiness score") +
  theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)), 
        axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm

The Model shows 67% accuracy on the test data.

Regression Diagnostics

Model-1

Code
#Model-1
par(mfrow = c(2,3)); plot(model1, which = 1:6)

Model-1 Diagnostics:

Residuals vs Fitted: The Linearity assumption is met and there is a constant variance of residuals along the line. Overall the plot looks good.

Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.

Scale-Location: The line is approximately horizontal, so I can say that the constant variance assumption is met.

Cook’s distance: No observation is greater than 1. The maximum cook’s distance is <0.3. 174th observation is the influential.

Residuals vs Leverage: No observations are beyond the cook’s distance and the leverage is low.

But the influential observation 174 is not at the same place in all graphs which is odd.

Model-2

Code
#Model-2
par(mfrow = c(2,3)); plot(model2, which = 1:6)

Model-2 Diagnostics:

Residuals vs Fitted: I would say the linearity assumption is violated here but the constant variance of residuals is present.

Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.

Scale-Location: Compared to model-1, the line is almost horizontal. This one looks good.

Cook’s distance: No observation is greater than 1. The maximum cook’s distance is 0.9 (approximately). The 156th observation is very influential.

Residuals vs Leverage: No observations are beyond the cook’s distance and the leverage is high as compared to model-1 which is good.

I can say that the regression diagnostics of Model-2 look much better than Model-1 because of the larger cook’s distance and leverage.

Model-3

Code
#Model-3
par(mfrow = c(2,3)); plot(model3, which = 1:6)

Residuals vs Fitted: The Linearity assumption is met and there is a constant variance of residuals along the line. Overall the plot looks good.

Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.

Scale-Location: Compared to model-1 & 2, the line is not horizontal.

Cook’s distance: No observation is greater than 1. The maximum cook’s distance is < 0.3 (approximately). Again as similar to model-1, the 174th observation is very influential.

Residuals vs Leverage: No observations are beyond the cook’s distance and the leverage is low.

But the influential observation 174 is not at the same place in all graphs which is odd.

Conclusion

The aim of this study is to find out which factors significantly effect world happiness scores and to find out a model that fits/predicts the happiness scores. The following is the summary of this study.

  • GDP per capita, Urbanization rate and Life expectancy are the factors that are significantly associated with Happiness scores.

  • North America, Europe and Australia appear to be the happiest.

  • South America, North Asia, Western Africa and countries/islands in the indian ocean appear to be happy on average.

  • South Africa, India are unhappy and of course Afghanistan is the unhappiest country in the world.

  • Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.

  • The correlation between GDP per capita and happiness score in Europe, North America, and Asia is more significant than the other continents.

  • The relationship between happiness score and urbanization rate is very significant in Asia, Africa and Europe followed by North america and south america.

  • The relationship between happiness score and life expectancy is very significant in Asia, Europe and Africa followed by North america and south america.

When it comes to modeling and hypothesis testing,

  • Model-1 concluded that GDP per capita and Life Expectancy are strongly associated with the happiness scores.

  • Model-2 concluded that Life Expectancy is more strongly associated with the happiness scores, giving evidence to the third hypothesis.

  • Regression diagnostics of model-2 showed larger cook distance and leverage.

References

  1. https://www.kaggle.com/code/noobiedatascientist/explaining-happiness/data

  2. https://worldhappiness.report/ed/2021/

  3. http://www.sthda.com/english/articles/40-regression-analysis/168-multiple-linear-regression-in-r/#building-model

  4. https://www.guru99.com/r-simple-multiple-linear-regression.html